Set up: load these packages!

library(caret)
library(DT)
library(tidyverse)
library(e1071)
library(gridExtra)
library(mda)
library(randomForest)

Introduction

To get started, you will need the following dependencies (R packages to install and datasets to download):

  1. Install R and R Studio
  2. Please install the following R packages for Data Management: tidyverse, DT, and gridExtra. In R:
install.packages(c("tidyverse", "DT", "gridExtra"))
  1. Please install the following R packages for Machine learning: umap, mda caret, e1071, rpart, randomForest, neuralnet. In R:
install.packages(c("caret", "e1071", "rpart", "randomForest"))
  1. You will need the following datasets: heights.rds, olive.rds, and TBnanostring.rds

Data Frames

A large proportion of data analysis challenges start with data stored in a data frame. For example, we stored the data for our motivating example in a data frame. You can access this dataset by loading TBNanostring.rds object in R:

TBnanostring <- readRDS("TBnanostring.rds")

In RStudio we can view the data with the View function:

View(TBnanostring)

Or in RMarkdown you can use the datatable function from the DT package:

datatable(TBnanostring)

You will notice that the TB status is found in the first column of the data frame, followed by the genes in the subsequent columns. The rows represent each individual patient.

Using the caret package

The caret package in R has several useful functions for building and assessing machine learning methods. It tries to consolidate many machine learning tools to provide a consistent syntax.

Using the height data

For a first example, we use the height data in dslabs:

heights <- readRDS("heights.rds")
datatable(heights)
boxplot(heights$height ~ heights$sex)

Partition dataset

The caret package includes the function createDataPartition that helps us generates indexes for randomly splitting the data into training and test sets:

set.seed(2007)
test_index <- createDataPartition(heights$sex, times = 1, 
                                  p = 0.5, list = FALSE)
test_set <- heights[test_index, ]
train_set <- heights[-test_index, ]

The argument times is used to define how many random samples of indexes to return, p is used to define what proportion of the data is represented by the index, and list is used to decide if we want the indexes returned as a list or not.

Simple predictor

Exploratory data analysis suggests we can because, on average, males are slightly taller than females:

heights %>% group_by(sex) %>% 
  summarize(mean(height), sd(height))
## # A tibble: 2 × 3
##   sex    `mean(height)` `sd(height)`
##   <fct>           <dbl>        <dbl>
## 1 Female           64.9         3.76
## 2 Male             69.3         3.61

Let’s try predicting with a simple approach: predict Male if height is within two standard deviations from the average male. The overall accuracy is 0.78 in the test set:

pred_test <- ifelse(test_set$height > 62, "Male", "Female") %>% 
  factor(levels = levels(test_set$sex))
confusionMatrix(pred_test, test_set$sex)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     22   18
##     Male       97  388
##                                           
##                Accuracy : 0.781           
##                  95% CI : (0.7431, 0.8156)
##     No Information Rate : 0.7733          
##     P-Value [Acc > NIR] : 0.3607          
##                                           
##                   Kappa : 0.1836          
##                                           
##  Mcnemar's Test P-Value : 3.502e-13       
##                                           
##             Sensitivity : 0.18487         
##             Specificity : 0.95567         
##          Pos Pred Value : 0.55000         
##          Neg Pred Value : 0.80000         
##              Prevalence : 0.22667         
##          Detection Rate : 0.04190         
##    Detection Prevalence : 0.07619         
##       Balanced Accuracy : 0.57027         
##                                           
##        'Positive' Class : Female          
## 

Use the train function

The caret package currently includes 237+ different machine learning methods, which can be applied using the train function. These are summarized in the caret package manual.

Keep in mind that caret does not include the needed packages and, to implement a package through caret, you still need to install the library. For example:

height_glm <- train(sex ~ height, method = "glm", data=train_set)
confusionMatrix(predict(height_glm, train_set), 
                train_set$sex)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Female Male
##     Female     61   18
##     Male       58  388
##                                           
##                Accuracy : 0.8552          
##                  95% CI : (0.8222, 0.8842)
##     No Information Rate : 0.7733          
##     P-Value [Acc > NIR] : 1.679e-06       
##                                           
##                   Kappa : 0.5314          
##                                           
##  Mcnemar's Test P-Value : 7.691e-06       
##                                           
##             Sensitivity : 0.5126          
##             Specificity : 0.9557          
##          Pos Pred Value : 0.7722          
##          Neg Pred Value : 0.8700          
##              Prevalence : 0.2267          
##          Detection Rate : 0.1162          
##    Detection Prevalence : 0.1505          
##       Balanced Accuracy : 0.7341          
##                                           
##        'Positive' Class : Female          
## 

SVMs (Linear)

Generate dataset

First generate some data in 2 dimensions, and make them a little separated:

set.seed(10111)
x = matrix(rnorm(40), 20, 2)
y = rep(c(-1, 1), c(10, 10))
x[y == 1,] = x[y == 1,] + 1
plot(x, col = y + 3, pch = 19)

Apply SVM

We will use the e1071 package which contains the svm function that works on the dataframe (\(y\) needs to be a factor variable). Printing the svmfit gives its summary.

dat = data.frame(x, y = as.factor(y))
svmfit = svm(y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
print(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  6

You can see that the number of support vectors is 6 - they are the points that are close to the boundary or on the wrong side of the boundary.

Plot SVM (version 1)

There’s a generic plot function for SVM that shows the decision boundary, as you can see below. It doesn’t seem there’s much control over the colors. It breaks with convention since it puts x2 on the horizontal axis and x1 on the vertical axis.

plot(svmfit, dat)

Plot SVM (version 2)

Or plotting it more cleanly:

make.grid = function(x, n = 75) {
  grange = apply(x, 2, range)
  x1 = seq(from = grange[1,1], to = grange[2,1], length = n)
  x2 = seq(from = grange[1,2], to = grange[2,2], length = n)
  expand.grid(X1 = x1, X2 = x2)
}
xgrid = make.grid(x)
ygrid = predict(svmfit, xgrid)
plot(xgrid, col = c("red","blue")[as.numeric(ygrid)], pch = 20, cex = .2)
points(x, col = y + 3, pch = 19)
points(x[svmfit$index,], pch = 5, cex = 2)

Unfortunately, the svm function is not too friendly, in that you have to do some work to get back the linear coefficients. The reason is probably that this only makes sense for linear kernels, and the function is more general. So let’s use a formula to extract the coefficients more efficiently. You extract \(\beta\) and \(\beta_0\), which are the linear coefficients.

beta = drop(t(svmfit$coefs)%*%x[svmfit$index,])
beta0 = svmfit$rho

Now you can replot the points on the grid, then put the points back in (including the support vector points). Then you can use the coefficients to draw the decision boundary using a simple equation of the form:

\[\beta_0+x_1\beta_1+x_2\beta_2=0\]

Now plotting the lines on the graph:

plot(xgrid, col = c("red", "blue")[as.numeric(ygrid)], pch = 20, cex = .2)
points(x, col = y + 3, pch = 19)
points(x[svmfit$index,], pch = 5, cex = 2)
abline(beta0 / beta[2], -beta[1] / beta[2])
abline((beta0 - 1) / beta[2], -beta[1] / beta[2], lty = 2)
abline((beta0 + 1) / beta[2], -beta[1] / beta[2], lty = 2)

SVM Nanostring data

Remember the PCA dimension reduction of the TB Nanostring dataset. The points are colored based on TB status.

Now let’s try an SVM on the PCs of the Nanostring data

# use only the first 2 PCs
dat = data.frame(y = pca_reduction$Condition, 
                 pca_reduction[,1:2])

fit = svm(y ~ ., data = dat, scale = FALSE, 
          kernel = "linear", cost = 10)
print(fit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  52

We can evaluate the predictor with a Confusion matrix:

confusionMatrix(dat$y,predict(fit,dat))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction TB LTBI
##       TB   69   10
##       LTBI  8   92
##                                           
##                Accuracy : 0.8994          
##                  95% CI : (0.8457, 0.9393)
##     No Information Rate : 0.5698          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7955          
##                                           
##  Mcnemar's Test P-Value : 0.8137          
##                                           
##             Sensitivity : 0.8961          
##             Specificity : 0.9020          
##          Pos Pred Value : 0.8734          
##          Neg Pred Value : 0.9200          
##              Prevalence : 0.4302          
##          Detection Rate : 0.3855          
##    Detection Prevalence : 0.4413          
##       Balanced Accuracy : 0.8990          
##                                           
##        'Positive' Class : TB              
## 

Plotting Nanostring data:

plot(fit,dat,PC2~PC1)

And plotting the results more cleanly:

SVMs (Non-linear)

Example 1 (polynomial kernel)

Apply polynomial SVM

Now let’s apply a non-linear (polynomial) SVM to our prior simulated dataset.

dat = data.frame(x, y = as.factor(y))
svmfit = svm(y ~ ., data = dat, 
             kernel = "polynomial", cost = 10, scale = FALSE)
print(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "polynomial", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  10 
##      degree:  3 
##      coef.0:  0 
## 
## Number of Support Vectors:  4

Plot results

Plotting the result:

Example 2 (radial basis kernel)

Load data

Here is a more complex example from Elements of Statistical Learning (from the mda R package), where the decision boundary needs to be non-linear and there is no clear separation.

rm(x,y)
data(ESL.mixture)
attach(ESL.mixture)
names(ESL.mixture)
## [1] "x"        "y"        "xnew"     "prob"     "marginal" "px1"      "px2"     
## [8] "means"

Plotting the data:

plot(x, col = y + 1)

Apply SVM (radial basis kernel)

Now make a data frame with the response \(y\), and turn that into a factor. We will fit an SVM with radial kernel.

dat = data.frame(y = factor(y), x)
fit = svm(factor(y) ~ ., data = dat, scale = FALSE, 
          kernel = "radial", cost = 5)
print(fit)
## 
## Call:
## svm(formula = factor(y) ~ ., data = dat, kernel = "radial", cost = 5, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  5 
## 
## Number of Support Vectors:  103

Plotting the result

It’s time to create a grid and predictions. We use expand.grid to create the grid, predict each of the values on the grid, and plot them:

xgrid = expand.grid(X1 = px1, X2 = px2)
ygrid = predict(fit, xgrid)
plot(xgrid, col = as.numeric(ygrid), pch = 20, cex = .2)
points(x, col = y + 1, pch = 19)

Plotting with a contour:

Decision Trees

Olive oil

We will use a new dataset that includes the breakdown of the composition of olive oil into 8 fatty acids:

olive <- readRDS("olive.rds")
olive <- select(olive, -area) #remove the `area` column--don't use it
olive %>% datatable()

We will try to predict the region using the fatty acid composition values as predictors.

table(olive$region)
## 
## Northern Italy       Sardinia Southern Italy 
##            151             98            323

Boxplots

A bit of data exploration reveals that we should be able to do even better: note that eicosenoic is only in Southern Italy and linoleic separates Northern Italy from Sardinia.

olive %>% gather(fatty_acid, percentage, -region) %>%
  ggplot(aes(region, percentage, fill = region)) +
  geom_boxplot() +
  facet_wrap(~fatty_acid, scales = "free", ncol = 4) +
  theme(axis.text.x = element_blank(), legend.position="bottom")

Paritions

This implies that we should be able to build an algorithm that predicts perfectly! Let’s try plotting the values for eicosenoic and linoleic.

olive %>% 
  ggplot(aes(eicosenoic, linoleic, color = region)) + 
  geom_point(size=2) +
  geom_vline(xintercept = 0.065, lty = 2) + 
  geom_segment(x = -0.2, y = 10.54, xend = 0.065, yend = 10.54, 
               color = "black", lty = 2)

Decision tree

Let’s define a decision rule: If eicosenoic is larger than 0.065, predict Southern Italy. If not, then if linoleic is larger than \(10.535\), predict Sardinia, otherwise predict Northern Italy. We can draw this decision tree:

train_rpart <- train(region ~ ., method = "rpart", data = olive)
plot(train_rpart$finalModel, margin = 0.1)
text(train_rpart$finalModel, cex = 0.75)

Random Forests

Summarize data

For this example we will use the Iris data in R

datatable(iris)
iris %>% ggplot(aes(Petal.Width, Petal.Length, color = Species)) +
  geom_point()

Select train/test

set.seed(0)
test_index <- createDataPartition(iris$Species, times = 1, 
                                  p = 0.3, list = FALSE)
iris_test <- iris[test_index, ]
iris_train <- iris[-test_index, ]

Apply Random Forest

iris_classifier <- randomForest(Species~., data = iris_train, 
                    importance=T)
iris_classifier
## 
## Call:
##  randomForest(formula = Species ~ ., data = iris_train, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.86%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         35          0         0  0.00000000
## versicolor      0         34         1  0.02857143
## virginica       0          2        33  0.05714286

Results summary

plot(iris_classifier)

importance(iris_classifier)
##                 setosa versicolor virginica MeanDecreaseAccuracy
## Sepal.Length  6.247640   8.383367  5.865249            10.618792
## Sepal.Width   4.490071   3.098781  3.295522             5.011737
## Petal.Length 22.641994  36.724624 25.382407            34.376688
## Petal.Width  21.561037  31.744843 24.616309            30.800022
##              MeanDecreaseGini
## Sepal.Length         5.527267
## Sepal.Width          1.786515
## Petal.Length        31.838258
## Petal.Width         30.130208
varImpPlot(iris_classifier)

Predictions

predicted_table <- predict(iris_classifier, iris_test[,-5])
confusionMatrix(predicted_table, iris_test[,5])
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         13         1
##   virginica       0          2        14
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9333         
##                  95% CI : (0.8173, 0.986)
##     No Information Rate : 0.3333         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9            
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8667           0.9333
## Specificity                 1.0000            0.9667           0.9333
## Pos Pred Value              1.0000            0.9286           0.8750
## Neg Pred Value              1.0000            0.9355           0.9655
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2889           0.3111
## Detection Prevalence        0.3333            0.3111           0.3556
## Balanced Accuracy           1.0000            0.9167           0.9333

Additional practice

The TBnanostring.rds dataset contains gene expression measurements in the blood for 107 TB-related genes for 179 patients with either active tuberculosis infection (TB) or latent TB infection (LTBI) from one of Dr. Johnson’s publications. When you Load these data into R ( TBnanostring <- readRDS("TBnanostring.rds")) the TB status is found in the first column of the data frame, followed by the genes in the subsequent columns. The rows represent each individual patient.

Here is a UMAP clustering of the dataset, and plot the result using ggplot. The points are colored based on TB status.

Split the dataset into “training” and “testing” sets using a 70/30 partition, using set.seed(0) and the createDataPartition function from the caret package (code is ‘hiding’ in the .Rmd file!). Apply the following machine learning methods to make a predictive biomarker to distinguish between the TB and control samples, use the caret package and cross validation to find the “finalModel” parameters to for each method.

Now, using the caret::train() function, apply the following machine learning methods to make a predictive biomarker to distinguish between the TB and control samples, use the caret package and cross validation to find the “finalModel” parameters to for each method. Provide any relevant/informative plots with your results.

  1. Split the dataset into “training” and “testing” sets using a 70/30 partition (use set.seed(0) and the caret::createDataPartition).
  2. Apply a Support Vector Machine to these data (try linear, radial, and polynomial kernels).
  3. Apply a Random Forest Model to these data.
  4. Compare the overall accuracy of the prediction methods for each of the machine learning tools in the previous problem. Which one performs the best?

(Note: the TBnanostring.Rmd and TBnanostring.html files provide suggested solutions for these analyses)

Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] umap_0.2.10.0        randomForest_4.7-1.2 mda_0.5-5           
##  [4] class_7.3-23         gridExtra_2.3        e1071_1.7-16        
##  [7] lubridate_1.9.4      forcats_1.0.0        stringr_1.5.1       
## [10] dplyr_1.1.4          purrr_1.1.0          readr_2.1.5         
## [13] tidyr_1.3.1          tibble_3.3.0         tidyverse_2.0.0     
## [16] DT_0.33              caret_7.0-1          lattice_0.22-7      
## [19] ggplot2_3.5.2       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4041.110    farver_2.1.2        
##  [4] fastmap_1.2.0        pROC_1.18.5          digest_0.6.37       
##  [7] rpart_4.1.24         timechange_0.3.0     lifecycle_1.0.4     
## [10] survival_3.8-3       magrittr_2.0.3       compiler_4.5.1      
## [13] rlang_1.1.6          sass_0.4.10          tools_4.5.1         
## [16] utf8_1.2.6           yaml_2.3.10          data.table_1.17.8   
## [19] knitr_1.50           askpass_1.2.1        labeling_0.4.3      
## [22] htmlwidgets_1.6.4    reticulate_1.43.0    plyr_1.8.9          
## [25] RColorBrewer_1.1-3   withr_3.0.2          nnet_7.3-20         
## [28] grid_4.5.1           stats4_4.5.1         future_1.58.0       
## [31] globals_0.18.0       scales_1.4.0         iterators_1.0.14    
## [34] MASS_7.3-65          cli_3.6.5            rmarkdown_2.29      
## [37] generics_0.1.4       RSpectra_0.16-2      rstudioapi_0.17.1   
## [40] future.apply_1.20.0  reshape2_1.4.4       tzdb_0.5.0          
## [43] cachem_1.1.0         proxy_0.4-27         splines_4.5.1       
## [46] parallel_4.5.1       vctrs_0.6.5          hardhat_1.4.1       
## [49] Matrix_1.7-3         jsonlite_2.0.0       hms_1.1.3           
## [52] listenv_0.9.1        crosstalk_1.2.1      foreach_1.5.2       
## [55] gower_1.0.2          jquerylib_0.1.4      recipes_1.3.1       
## [58] glue_1.8.0           parallelly_1.45.0    codetools_0.2-20    
## [61] stringi_1.8.7        gtable_0.3.6         pillar_1.11.0       
## [64] htmltools_0.5.8.1    openssl_2.3.3        ipred_0.9-15        
## [67] lava_1.8.1           R6_2.6.1             evaluate_1.0.4      
## [70] png_0.1-8            bslib_0.9.0          Rcpp_1.1.0          
## [73] nlme_3.1-168         prodlim_2025.04.28   xfun_0.52           
## [76] ModelMetrics_1.2.2.2 pkgconfig_2.0.3